Skip to content

Conversation

@Zijie-Tian
Copy link

Make sure to read the contributing guidelines before submitting a PR

Zijie Tian and others added 30 commits May 10, 2025 23:42
- Introduced `run-prefill-decode-bench.sh` for executing prefill-decode benchmarks with customizable parameters.
- Added `extract_bench_results.py` to process benchmark markdown files and extract structured data into CSV format.
- Updated `.gitignore` to include `bench_results` directory for generated files.
- Introduced `analyze_benchmark_results.py` for processing benchmark CSV files and generating performance pivot tables.
- Updated `run-prefill-decode-bench.sh` to support multiple KV cache types and added options for prompt length and forced alignment.
- Modified `extract_bench_results.py` to accommodate broader file matching for markdown files.
- Introduced `run_op_bench.sh` to execute Flash Attention benchmarks with customizable parameters for head sizes, KV lengths, and quantization types.
- Added `summary_flash_attn.py` for processing benchmark results, extracting performance metrics, and generating analysis summaries.
- Enhanced test cases in `test-backend-ops.cpp` to include additional KV lengths and quantization types for comprehensive performance evaluation.
- Introduced a new profiling feature for the ggml library to track operation timings within computation graphs.
- Added `ggml-profile.h` and `ggml-profile.cpp` to define profiling structures and functions.
- Updated `CMakeLists.txt` to include options for enabling the graph profiler.
- Modified existing source files to integrate profiling calls during graph computations, allowing for performance analysis.
- Enhanced `CMakePresets.json` with new presets for profiling builds.
- Added a function to enable or disable GGML graph profiling based on a specified path.
- Updated the `test_gen` function to conditionally set profiling during the last generation iteration.
- Ensured profiling is reset after each benchmark run in the main function.
- Improved overall profiling integration for better performance analysis during benchmarks.
- Updated the output format in `ggml-profile.cpp` to use a CSV style for better readability and easier parsing.
- Introduced a global variable in `llama-bench.cpp` to manage the GGML_GRAPH_PROFILE setting, allowing for dynamic configuration.
- Added a function to retrieve the current value of the GGML_GRAPH_PROFILE environment variable, enhancing flexibility in profiling setup.
- Introduced `run-breakdown.sh` to facilitate operator breakdown profiling with customizable parameters such as model path, thread count, output directory, and prefill depths.
- Updated `.gitignore` to exclude specific breakdown results files.
- Enhanced `llama-bench.cpp` to support profiling during prefill and decode operations, improving performance analysis capabilities.
- Introduced `analyze_breakdown.py` to parse CSV files, analyze operator performance, and generate visualizations.
- Implemented functions for data cleaning, operator analysis, and visualization in both bar and pie chart formats.
- Added command-line interface for processing multiple CSV files or a specific file, with options for generating comparison charts across depths.
- Introduced `SKIP_ANALYSIS` flag to allow users to skip the data analysis step during profiling.
- Updated help information to include the new flag and its default value.
- Added a function to check for Python dependencies and provide warnings if they are missing.
- Adjusted output to reflect changes in how results are displayed based on the new flag.
- Added T-MAC quantization types and configurations in the ggml library.
- Enhanced the `convert_hf_to_gguf.py` script to support T-MAC options and quantization configurations.
- Updated CMake files to include T-MAC compilation options and source files.
- Introduced new utility functions for T-MAC handling in the gguf Python module.
- Modified existing quantization logic to accommodate T-MAC types and ensure compatibility with the new formats.
- Improved model loading and tensor operations to leverage T-MAC optimizations.
- Added T-MAC quantization types and validation in ggml.h and ggml-quants.c.
- Updated type traits and tensor size calculations in ggml.c to accommodate T-MAC types.
- Enhanced CMake configuration to conditionally include T-MAC source files based on compilation flags.
- Modified llama model loader and quantization logic to support T-MAC types.
- Ensured compatibility and proper handling of T-MAC types across various components.
- Adjusted the T-MAC type count in ggml.h to reflect the correct number of types based on compilation flags.
- Updated CMakeLists.txt to ensure proper inclusion of T-MAC definitions and directories, removing unnecessary comments for clarity.
- Introduced a new test file `test-quantize-accuracy.cpp` to evaluate the accuracy of quantization and dequantization processes.
- Updated `CMakeLists.txt` to include the new accuracy test in the build process, ensuring comprehensive testing of quantization functionalities.
- Introduced a new option for QlutAttn in CMake configuration to enable its usage.
- Updated CMakeLists.txt to conditionally compile QlutAttn related definitions and include directories.
- Enhanced the ggml-base target to support QlutAttn functionality, ensuring proper integration within the library.
- Added additional T-MAC quantization types to the kv_cache_types in arg.cpp.
- Updated ggml.h to reflect the correct count of T-MAC types without conditional compilation.
- Enhanced llama-graph.cpp to support new T-MAC types in the attention mechanism, ensuring compatibility with existing functionality.
- Introduced a new example `flash-attn-inspector` to demonstrate the usage of flash attention in LLaMA models.
- Added corresponding CMake configuration to include the new example in the build process.
- Implemented the main functionality in `flash-attn-inspector.cpp`, including tensor data handling and logging for debugging purposes.
- Enhanced the testing framework with a new test target for evaluating callback functionality during inference.
- Added entries for `breakdown_results` and `breakdown_results_llamacpp` directories to the .gitignore file, ensuring that generated files from breakdown profiling are excluded from version control.
- Updated `.gitignore` to include `breakdown_results_llamacpp/` directory.
- Added documentation files for `ggml_structure.mdc` and `project_structure.mdc` to provide an overview of the project and its components.
- Introduced `python_scripts.mdc` to outline the usage of Python scripts within the project.
- Added new test files: `test-flash-attn.cpp` and `test-mul-mat.cpp` to validate the functionality of flash attention and matrix multiplication operations.
- Updated `CMakeLists.txt` to include new test targets for improved testing coverage.
- Introduced `ggml_cpu_structure.mdc` to detail the CPU-specific implementation of the GGML tensor library, including core source files, operation implementations, and architecture-specific optimizations.
- Updated `ggml_structure.mdc` to reference the new CPU backend documentation, enhancing overall project clarity.
Zijie-Tian and others added 22 commits June 19, 2025 05:37
…ith quantized tensor support and improve computation graph handling
…flash_attn_ext_mixed-function

Fix mask indexing in mixed flash attention and correct Q initialization
…efill-test-and-align_kv-mixed.sh

Fix causal mask padding in flash decoding test
…ng and adding layer-wise K/V quantization operations. Improved logging for debugging and computation graph handling.
…0-quantization-007b

Modify custom op for Q4_0 quantization
…ard-flash-attn-ext-f16-function-7321

Modify ggml_compute_forward_flash_attn_ext_f16 function
Implemented a function to convert ggml tensors to torch tensors using type traits, including support for various tensor types. Enhanced the dequantization function to utilize type traits for improved float conversion and added error handling for unsupported types. This update improves integration with PyTorch and facilitates better tensor management.
…sh-attn-ext-f16-febc

Fixed ggml_compute_forward_flash_attn_ext_f16_with_state
@Zijie-Tian Zijie-Tian closed this Jun 20, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues script Script related testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning python python script changes script Script related testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants